About the project

Hi! I’ve used R before, but the first week has been great to repeat some syntax. I do expect to learn Markdown since it has been in my to-do list for years. This course was recommended to me by my doctoral school.
This is my project


R Studio Exercise 2: Regression and model validation

November 11, 2019
This week I have performed a data wrangling on a dataset, analyzed and interpreted my regression model and did some nice plots.

Looking at the dataset

Reading a file and attaching the needed libraries (dplyr gives some little error but works fine):

setwd("~/IODS-project/data")
library(ggplot2)
library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
my_learning <- read.table("learning2014.csv")

Exploring structure and dimensions:

str(my_learning)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...

We have numeric variables (deep = deep learning, stra = strategic learning, surf = surface learning), 3 integer variables (points, age, attitude) and 1 factor variable (gender).
Lets also look at dimensions:

dim(my_learning)
## [1] 166   7

The dataset has 166 observations and 7 variables.

Graphs

Attitude and Points coloured by gender:

ggplot(my_learning, aes(x = Attitude, y = Points, col = gender)) +
  geom_point()

Summaries of variables:

summary(my_learning)
##  gender       Age           Attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :14.00   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :32.00   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :31.43   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :50.00   Max.   :4.917   Max.   :5.000  
##       surf           Points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

We have twice more males than females.

Distribution of other variables:
Age

hist(my_learning$Age, xlab = "Age", main = "Distribution of Age")

Distribution of age is not normal.

Attitude

hist(my_learning$Attitude, xlab = "Attitude", main = "Distribution of Attitude")

Distribution of attitude is normal.

Deep learning

hist(my_learning$deep, xlab = "Deep learning", main = "Distribution of Deep learning")

Distribution of deep learning is normal.

Strategic learning

hist(my_learning$stra, xlab = "Strategic learning", main = "Distribution of Strategic learning")

Distribution of strategic learning is not normal.

Surface learning

hist(my_learning$surf, xlab = "Surface learning", main = "Distribution of Surface learning")

Distribution of surface learning is normal.

Points

hist(my_learning$Points, xlab = "Points", main = "Distribution of Points")

Distribution of points is normal.

Now lets see relationship between variables:

ggpairs(my_learning, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

Conclusion on variables: correlation is actually very low for all variables, but the highest is for attitude&points.

Regression model

Now I do multiple regression model with attutude, strategic questions and deep learning questions; exam points is a dependent variable:

model <- lm(Points ~ Attitude + stra + deep, data = my_learning)

Summary of the model:

summary(model)
## 
## Call:
## lm(formula = Points ~ Attitude + stra + deep, data = my_learning)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5239  -3.4276   0.5474   3.8220  11.5112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.39145    3.40775   3.343  0.00103 ** 
## Attitude     0.35254    0.05683   6.203 4.44e-09 ***
## stra         0.96208    0.53668   1.793  0.07489 .  
## deep        -0.74920    0.75066  -0.998  0.31974    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared:  0.2097, Adjusted R-squared:  0.195 
## F-statistic: 14.33 on 3 and 162 DF,  p-value: 2.521e-08

Interpretation of model results and stats:
1. P value is very low, which means that, at least, one of the predictor variables is significantly related to the outcome variable;
2. Only attitude can explain points well, that is, changes in attitude will affect points.

As others are not significating, we can remove them from the model:

modelnostranodeep <- lm(Points ~ Attitude, data = my_learning)
summary(modelnostranodeep)
## 
## Call:
## lm(formula = Points ~ Attitude, data = my_learning)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Conclusion: the best model is the last one (it minimizes the prediction errors = residuals).

I am going back to the original model:

model <- lm(Points ~ Attitude + stra + deep, data = my_learning)
summary(model)
## 
## Call:
## lm(formula = Points ~ Attitude + stra + deep, data = my_learning)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5239  -3.4276   0.5474   3.8220  11.5112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.39145    3.40775   3.343  0.00103 ** 
## Attitude     0.35254    0.05683   6.203 4.44e-09 ***
## stra         0.96208    0.53668   1.793  0.07489 .  
## deep        -0.74920    0.75066  -0.998  0.31974    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared:  0.2097, Adjusted R-squared:  0.195 
## F-statistic: 14.33 on 3 and 162 DF,  p-value: 2.521e-08

Relationship between variables and the target: only changes in attitude affect the exam points.
R = 0.2097, which means that only 20% of of the variance in the points can be explained by chosen variables.

Model plots

So our model is:

model <- lm(Points ~ Attitude + stra + deep, data = my_learning)

First, plot of Residuals vs Fitted Values: used to explore whether errors depend on explanatory variables:

plot(model, which = c(1))

Result: our plot looks reasonable.

Second, Normal QQplot: used to explore whether the errors are normally distributed:

plot(model, which = c(2))

Result: our QQplot looks reasonable.

Third, plot of Residuals vs Leverage: identifies which observations have unusually high impact on the model:

plot(model, which = c(5))

Result: our plot looks bad: there is an outlier and leverage is high.

Conclusion: we should remove outliers.


R Studio Exercise 3: Logistic regression

November 13, 2019
This week I have performed a data wrangling on a joined dataset and did logistic regression.

Our dataset

Reading a file and attaching the needed libraries (dplyr gives some little error but works fine):

setwd("~/IODS-project/data")
library(ggplot2)
library(dplyr)
library(boot)
joineddata <- read.table("joineddata.csv")

Printing out variable names:

colnames(joineddata)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

This is data from two questionnaires related to student performance (including alcohol consumption).
The dataset consists of 382 observations (rows) and 35 variables (columns).

Lets hypothesise

  1. Alcohol consumption and gender are related: males consume more alcohol, than females.
  2. Alcohol consumption and first period grade are related: greater first grade means less alcohol consumption for both genders.
  3. Alcohol consumption and second period grade are related: greater second grade means less alcohol consumption for both genders.
  4. Alcohol consumption and third period grade are related: greater third grade means less alcohol consumption for both genders.

Exploring the variables

Before looking at variables, lets look at the alcohol consumption distribution:

plot(density(joineddata$alc_use), main = "Distribution of alcohol consumption")

We can see that the consumption is mostly low across all dataframe.

Now lets do some numerical exploration:

Gender

summary(joineddata$sex)
##   F   M 
## 198 184

We can see relatively equal amount of males and females.

First period grades

summary(joineddata$G1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   10.00   12.00   11.49   14.00   18.00
hist(joineddata$G1, main = "Distribution of first period grades", xlab = "Grade")

Distribution of first period grades is normal.

Second period grades

summary(joineddata$G2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00   10.00   12.00   11.47   14.00   18.00
hist(joineddata$G2, main = "Distribution of second period grades", xlab = "Grade")

Distribution of second period grades is normal.

Final grades

summary(joineddata$G3)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   12.00   11.46   14.00   18.00
hist(joineddata$G3, main = "Distribution of final grades", xlab = "Grade")

Distribution of final grades is normal.

Now it’s time for graphical exploration:

Hypothesis 1: males consume more alcohol, than females.

g1 <- ggplot(data = joineddata, aes(x = alc_use, fill = sex))
g1 + geom_bar() + facet_wrap("sex")

The bar plot shows that much more females consume alcohol than males, however, they mostly do it rare.
On the other hand, much less males consume alcohol, but if they do - they do it regularly.

g1 <- ggplot(data = joineddata, aes(x = high_use, fill = sex))
g1 + geom_bar() + facet_wrap("sex")

This bar graph confirms the previous explanation, women don’t consume much.
Conclusion: my hypothesis can be partly confirmed.

Hypothesis 2: greater first grade means less alcohol consumption for both genders.

g3 <- ggplot(joineddata, aes(x = high_use, y = G1, col = sex))
g3 + geom_boxplot() 

There is no difference between girls whose consumption is high or low.
However, boys who consume more have lower grades.
Conclusion: my hypothesis can be partly confirmed (for boys).

Hypothesis 3: greater second grade means less alcohol consumption for both genders.

g4 <- ggplot(joineddata, aes(x = high_use, y = G2, col = sex))
g4 + geom_boxplot() 

Visually there is almost no difference between girls, but it may change after removing an outlier.
Visually, boys who consume more have lower grades.
Conclusion: my hypothesis can be partly confirmed (for boys).

Hypothesis 4: greater final grade means less alcohol consumption for both genders.

g5 <- ggplot(joineddata, aes(x = high_use, y = G3, col = sex))
g5 + geom_boxplot() 

There is no difference between girls, but it may change after removing an outlier.
Boys who consume more have lower grades, but may also change after removing an outlier.
Conclusion: not certain, needs statistical tests.

Logistic regression

m <- glm(high_use ~ sex + G1 + G2 + G3, data = joineddata, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ sex + G1 + G2 + G3, family = "binomial", 
##     data = joineddata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2828  -0.8657  -0.6845   1.2340   2.0382  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.140881   0.530729   0.265 0.790664    
## sexM         0.883365   0.233216   3.788 0.000152 ***
## G1          -0.120940   0.094241  -1.283 0.199385    
## G2          -0.010811   0.116916  -0.092 0.926328    
## G3           0.002673   0.086031   0.031 0.975211    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 442.10  on 377  degrees of freedom
## AIC: 452.1
## 
## Number of Fisher Scoring iterations: 4

The model can be interpreted as follows:
1. Being a male increases the chance of high alcohol consumption by 0.88 (significant)
2. Each one-unit change in G1 will decrease the log odds of high alcohol consumption by -0.12 (not significant)
3. Each one-unit change in G2 will decrease the log odds of high alcohol consumption by -0.10 (not significant)
4. Each one-unit change in G3 will increase the log odds of high alcohol consumption by 0.00 (not significant)

Lets look at model’s coefficients:

coef(m)
## (Intercept)        sexM          G1          G2          G3 
##  0.14088116  0.88336501 -0.12093983 -0.01081064  0.00267323

And odds ratios:

OR <- coef(m) %>% exp
OR
## (Intercept)        sexM          G1          G2          G3 
##   1.1512878   2.4190261   0.8860873   0.9892476   1.0026768

And confidence intervals:

CI <- confint(m) %>% exp
## Waiting for profiling to be done...
CI
##                 2.5 %   97.5 %
## (Intercept) 0.4066194 3.274532
## sexM        1.5383105 3.843873
## G1          0.7350499 1.065697
## G2          0.7864414 1.246077
## G3          0.8475937 1.190283

Now lets look at all them:

cbind(OR, CI)
##                    OR     2.5 %   97.5 %
## (Intercept) 1.1512878 0.4066194 3.274532
## sexM        2.4190261 1.5383105 3.843873
## G1          0.8860873 0.7350499 1.065697
## G2          0.9892476 0.7864414 1.246077
## G3          1.0026768 0.8475937 1.190283

All the above can be interpreted as follows:
1. ORs for sex and G3 are more than 1, which means there is an association between sex and alcohol, and final grade and alcohol.
2. ORs for G1 and G2 are less than 1, which means no association between first or second grade and alcohol.
Therefore, my 1st and 4th hypotheses are confirmed.

Predictive power

I am only using the “sex” and the “G3” now.

m <- glm(high_use ~ sex + G3, data = joineddata, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ sex + G3, family = "binomial", data = joineddata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4158  -0.8453  -0.6698   1.2796   1.9827  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.33776    0.41861  -0.807 0.419757    
## sexM         0.88254    0.23246   3.796 0.000147 ***
## G3          -0.08688    0.03466  -2.507 0.012189 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 444.55  on 379  degrees of freedom
## AIC: 450.55
## 
## Number of Fisher Scoring iterations: 4

Both variables are significant.

probabilities <- predict(m, type = "response")
joineddata <- mutate(joineddata, probability = probabilities)
joineddata <- mutate(joineddata, prediction = probability > 0.5)
select(joineddata, sex, G3, high_use, probability, prediction) %>% tail(10)
##     sex G3 high_use probability prediction
## 373   M  0    FALSE   0.6329243       TRUE
## 374   M  2     TRUE   0.5917072       TRUE
## 375   F 12    FALSE   0.2009626      FALSE
## 376   F  8    FALSE   0.2625459      FALSE
## 377   F  5    FALSE   0.3160153      FALSE
## 378   F 12    FALSE   0.2009626      FALSE
## 379   F  4    FALSE   0.3350868      FALSE
## 380   F  4    FALSE   0.3350868      FALSE
## 381   M 13     TRUE   0.3578685      FALSE
## 382   M 10     TRUE   0.4197026      FALSE

Now lets do 2x2 cross tabulation:

table(high_use = joineddata$high_use, prediction = joineddata$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   258   10
##    TRUE    104   10

And graphic representation:

g <- ggplot(joineddata, aes(x = probability, y = high_use, col = prediction))
g + geom_point()

Now we can explore the training error:

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
loss_func(class = joineddata$high_use, prob = joineddata$probability)
## [1] 0.2984293

Proportion of incorrect predictions (training error) is almost 30%.

If I compare this model’s performance to my simple guessing strategy, the model worked out better.

BONUS: 10-fold cross-validation

cv <- cv.glm(data = joineddata, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.3036649

The prediction error is 0.30 - it’s a bit worse than DataCamp’s model error.


R Studio Exercise 4: Clustering and classification

November 22, 2019
This week I worked with Boston dataset.

Playing with the dataset

Loading the Boston data and exploring the structure and dimensions:

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data('Boston')
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass.
It has 506 observations (rows) and 14 variables (columns).

Graphical overview of the data:

pairs(Boston)

We can see scatterplots of the relationships of data’s variables.
What we are looking here are any linear or U-shaped patterns, that might tell us about some dependancies. For instance, we can see some correlation patterns between “nox” and “dis”, “lstat” and “medv”, “lstat” and “rm”, “medv” and “rm”.
nox - nitric oxides concentration
dis - weighted distances to five Boston employment centres
lstat - % lower status of the population
medv - median value of owner-occupied homes in $1000’s
rm - average number of rooms per dwelling

Summaries of the variables:

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

We can see that the “chas” variable looks very weird, but it is due to it being integer (0 and 1).
Also, the “black” variable clearly has an outlier.
chas - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
black - proportion of blacks by town

Scaling the dataset:

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
boston_scaled <- as.data.frame(boston_scaled) #making our object a dataframe

The variables changed. We subtracted the column means from the corresponding columns and divided the difference with standard deviation. Now all variables means are zeros.

Creating a categorical variable:

bins <- quantile(boston_scaled$crim) #using the quantiles as the break points
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high")) #creating a variable
boston_scaled <- dplyr::select(boston_scaled, -crim) #dropping the old crime variable
boston_scaled <- data.frame(boston_scaled, crime) #adding the new crime variable to the dataset

Now lets divide the dataset to train and test ones:

n <- nrow(boston_scaled) #number of rows
ind <- sample(n,  size = n * 0.8) #random choice of 80% of rows
train <- boston_scaled[ind,] #train set
test <- boston_scaled[-ind,] #test set

Linear Discriminant Analysis

Lets fit the LDA on the train set:

lda.fit <- lda(crime ~ ., data = train) #crime is a target variable and all others are predictors

(Bi)Plotting the LDA:

classes <- as.numeric(train$crime) #we are creating a numeric vector of the train sets crime classes for plotting purposes
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1) #plotting the LDA

Now lets do some changes:

correct_classes <- test$crime #saving the correct classes from test data
test <- dplyr::select(test, -crime) #removing the crime variable from test data

We can now predict the classes on the test data:

lda.pred <- predict(lda.fit, newdata = test)

We can also cross-tabulate the results:

table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       10       7        1    0
##   med_low    5      15        7    0
##   med_high   0      11       17    2
##   high       0       0        1   26

I don’t think the LDA model predicted the classes well. It made some major mistakes regarding med_low and med_high predictions.

K-means and other stuff

For a start:

library(MASS)
data('Boston')
Boston2 <- scale(Boston) #standartizing (scaling) a dataset
Boston2 <- as.data.frame(Boston2) #saving as dataframe
dist_eu <- dist(Boston2) #calculating distances between observations

Now lets run k-means:

km <- kmeans(x = Boston2, centers = 3)

Now we can determine the optimal number of clusters:

set.seed(123) #this function prevents k-means from randomly assigning the initial cluster centers
k_max <- 10 #let the maximum amount of clusters be 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston2, k)$tot.withinss}) #calculating the total sum of squares
library(ggplot2)
qplot(x = 1:k_max, y = twcss, geom = 'line') #visualizing the clusters

The total WCSS changes radically around 2, which means 2 is an optimal number of clusters.
Lets run the code again and plot it:

km <-kmeans(Boston2, centers = 2)
pairs(Boston2, col = km$cluster)


R Studio Exercise 5: Dimensionality reduction techniques

November 29, 2019
This week I performed the dimensionality reduction techniques on “human” data.

Playing with the dataset

Loading the “human” data:

library(MASS)
human <- read.table("human.csv")
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : Factor w/ 155 levels "1,123","1,228",..: 132 109 124 112 113 111 103 123 108 93 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##                                                                   
##       GNI         Mat.Mor         Ado.Birth         Parli.F     
##  1,123  :  1   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1,228  :  1   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  1,428  :  1   Median :  49.0   Median : 33.60   Median :19.30  
##  1,458  :  1   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  1,507  :  1   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  1,583  :  1   Max.   :1100.0   Max.   :204.80   Max.   :57.50  
##  (Other):149

It is a United Nations dataset for Human Development Index in different countries depending on such parameters as education, labour, life expectancy and others.
The dataset has 195 observations (rows) and 19 variables (columns).

Let’s look at a graph:

pairs(human)

We can see scatterplots of the relationships of data’s variables.
We can see some U-shaped patterns and a lot of linear patterns that show that the variables are probably related. Making examples here would be exuberating, so lets proceed to look at statistics as it is.

Principal component analysis

Non-standardized variables

library(dplyr)
library(stringr)
human <- mutate(human, GNI = str_replace(human$GNI, pattern=",", replace ="") %>% as.numeric)
pca_human <- prcomp(human)

Variability captured by PCA:

summary(pca_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000

Now let’s draw a biplot:

biplot(pca_human, xlab = pca_human$PC1, ylab = pca_human$PC2, main = "Non-standardized variables: GNI and others")
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

It looks pathetic. Lets try to scale the variables now.

Standardized variables

human_std <- scale(human)
summary(human_std)
##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
pca_human_std <- prcomp(human_std)
summary(pca_human_std)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000
biplot(pca_human_std, col = c("grey40", "deeppink2"), xlab = pca_human_std$PC1, ylab = pca_human_std$PC2, main = "Standardized variables: emostly ducation and labour")

Looks much better! (We’ve also changed the color).
The scaling normalizes the range of variables, so it looks more neat.

Interpretations

Based on a scaled bi-plot, here is what we see:

  1. The angle between features represents high correlation between them.
  • Expected years of schooling (Edu.Exp), Life expectancy at birth (Life.Exp), Proportion of females and males with at least secondary education (Edu2.FM) and Gross National Income per capita (GNI) are highly positively correlated.
  • Percentage of female representatives in parliament (Parli.F) and Proportion of females and males in the labour force (Labo.FM) are also highly positively correlated. However, this correlation is smaller than (a) and (c).
  • Adolescent birth rate (Ado.Birth) and Maternal mortality ratio (Mat.Mor) are also highly positively correlated.
  1. The angle between the feature and the PC axis represents correlation between the two. If PC and the feature arrow are “in parallel”, it means, they are contributing to the same dimension.
  • Percentage of female representatives in parliament (Parli.F) and Proportion of females and males in the labour force (Labo.FM) contribute to the PC2.
  • All other features contribute to PC1.

Tea dataset

library(FactoMineR)
data("tea")
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36

Now lets select columns to do MCA:

keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- dplyr::select(tea, one_of(keep_columns))

Time for MCA!

mca <- MCA(tea_time, graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.279   0.261   0.219   0.189   0.177   0.156
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.144   0.141   0.117   0.087   0.062
## % of var.              7.841   7.705   6.392   4.724   3.385
## Cumulative % of var.  77.794  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.003   0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            0.027   2.867 |   0.433   9.160   0.338  10.053 |
## green                0.107  -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone                0.127  -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                0.035   3.226 |   1.329  14.771   0.218   8.081 |
## milk                 0.020   2.422 |   0.013   0.003   0.000   0.116 |
## other                0.102   5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag              0.161  -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged   0.478  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged           0.141  -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |

Interpretation:

The table contains the Eigenvalues, the Individuals, Categories and Categorical variables.

Let look at categorical variables: if the value is close to 1, then it means a strong link with the variable and dimension. We can see it here in “where” and “Dim2” and “Dim1”. Also in “how” and “Dim1”.

Now lets draw an MCA variable biplot:

plot(mca, invisible=c("var"))

Interpretation:

Variables are shown here on the first two dimensions. We can look at possible variable patterns: the distances give measures of variables similarity: the closer - the better.

For me, the variable plot makes no sense, so I prefer to draw an individuals plot.

plot(mca, invisible=c("ind"))

Here, for example, “unpackaged” and “tea shop” are more similar, than “lemon” and “black”.